Skip to content

Allow GCI.escapeHTML to take a custom escape table#55

Open
byroot wants to merge 1 commit into
ruby:masterfrom
byroot:escape-html-provided-list
Open

Allow GCI.escapeHTML to take a custom escape table#55
byroot wants to merge 1 commit into
ruby:masterfrom
byroot:escape-html-provided-list

Conversation

@byroot

@byroot byroot commented Feb 26, 2025

Copy link
Copy Markdown
Member

In some cases you may want to escape a string in a different way than the default behavior.

For instance, if you are trying to make some JSON safe to include in a <script> tag, you may want to escape less, and using JavaScript codepoints:

>> CGI.escapeHTML('Hello </script>', ">" => '\u003e', "<" => '\u003c', "&" => '\u0026')
=> "Hello \\u003c/script\\u003e"

Of course you can always use gsub for that, but CGI.escapeHTML being specialized is able to be very significantly faster:

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
                gsub    82.135k i/100ms
          escapeHTML   221.405k i/100ms
Calculating -------------------------------------
                gsub    821.890k (± 2.2%) i/s    (1.22 μs/i) -      4.189M in   5.099152s
          escapeHTML      2.330M (± 0.5%) i/s  (429.22 ns/i) -     11.734M in   5.036770s

Comparison:
          escapeHTML:  2329816.5 i/s
                gsub:   821889.7 i/s - 2.83x  slower

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
                gsub    36.235k i/100ms
          escapeHTML   171.347k i/100ms
Calculating -------------------------------------
                gsub    359.528k (± 1.5%) i/s    (2.78 μs/i) -      1.812M in   5.040422s
          escapeHTML      1.812M (± 0.7%) i/s  (551.84 ns/i) -      9.081M in   5.011762s

Comparison:
          escapeHTML:  1812105.3 i/s
                gsub:   359527.5 i/s - 5.04x  slower
require "benchmark/ips"
require "cgi"

ESCAPE = {
  ">" => '\u003e', "<" => '\u003c', "&" => '\u0026',
}

ESCAPE_B = {
  ">".b => '\u003e'.b,
  "<".b => '\u003c'.b,
  "&".b => '\u0026'.b,
}
ESCAPE_REGEX = Regexp.union(ESCAPE_B.keys)

str = ("a" * 1024).freeze
Benchmark.ips do |x|
  x.report("gsub") do
    b = str.b
    b.gsub!(ESCAPE_REGEX, ESCAPE_B)
    b.force_encoding(str.encoding)
  end

  x.report("escapeHTML") do
    CGI.escapeHTML(str, ESCAPE)
  end

  x.compare!
end

str = (("a" * 1023) + "<").freeze
Benchmark.ips do |x|
  x.report("gsub") do
    b = str.b
    b.gsub!(ESCAPE_REGEX, ESCAPE_B)
    b.force_encoding(str.encoding)
  end

  x.report("escapeHTML") do
    CGI.escapeHTML(str, ESCAPE)
  end

  x.compare!
end

NB: I haven't implemented the Java version, but can do it if there is interest in this feature.

Comment thread lib/cgi/util.rb Outdated
string = string.b
string.gsub!(/['&\"<>]/, TABLE_FOR_ESCAPE_HTML__)
if escape_table
string.gsub!(p Regexp.union(escape_table.keys), escape_table)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string.gsub!(p Regexp.union(escape_table.keys), escape_table)
string.gsub!(Regexp.union(escape_table.keys), escape_table)

No check for the keys and the values too?

Comment thread lib/cgi/util.rb Outdated
Comment on lines +87 to +92
pattern = "[".encode(enc)
escape_table.each_key do |key|
pattern << Regexp.escape(key).encode(enc)
end
pattern << "]".encode(enc)
string = string.gsub(/#{pattern}/, table)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys may be longer than one here?
No check for the value encodings?

If it is OK, why not Regexp.union too?

Suggested change
pattern = "[".encode(enc)
escape_table.each_key do |key|
pattern << Regexp.escape(key).encode(enc)
end
pattern << "]".encode(enc)
string = string.gsub(/#{pattern}/, table)
string = string.gsub(Regexp.union(escape_table.keys), table)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys may be longer than one here?

Indeed. But is it really a problem?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No check for the value encodings?

They are encoded too:

Hash[escape_table.map {|pair| pair.map {|s|s.encode(enc)}}]

Comment thread ext/cgi/escape/escape.c Outdated
dynamic_escape_html(VALUE str, VALUE rb_escape_table)
{
VALUE escape_table[UCHAR_MAX+1] = {0};
long max_escape_length = build_escape_table(escape_table, rb_escape_table);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key lengths can be changed the methods of the latter keys.
May be better to separate filling the table and calculating the max length.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key lengths can be changed the methods of the latter keys.

I don't think I understand what you are referring to.

Comment thread ext/cgi/escape/escape.c Outdated
@byroot

byroot commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Sorry this is an old patch, not very fresh in my head. I'll try to address your feedback.

@byroot byroot force-pushed the escape-html-provided-list branch 5 times, most recently from 653b625 to 8f73bcf Compare June 23, 2026 09:32
@byroot byroot requested a review from nobu June 23, 2026 09:33
@byroot byroot force-pushed the escape-html-provided-list branch from 8f73bcf to 57a61dc Compare June 23, 2026 09:37
In some cases you may want to escape a string in a different way than
the default behavior.

For instance, if you are trying to make some JSON save to include
in a `<script>` tag, you may want to escape less, and using JavaScript
codepoints:

```ruby
>> CGI.escapeHTML('Hello </script>', ">" => '\u003e', "<" => '\u003c', "&" => '\u0026')
=> "Hello \\u003c/script\\u003e"
```

Of course you can always use `gsub` for that, but `CGI.escapeHTML` being
specialized is able to be very significantly faster:

```
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
                gsub    82.135k i/100ms
          escapeHTML   221.405k i/100ms
Calculating -------------------------------------
                gsub    821.890k (± 2.2%) i/s    (1.22 μs/i) -      4.189M in   5.099152s
          escapeHTML      2.330M (± 0.5%) i/s  (429.22 ns/i) -     11.734M in   5.036770s

Comparison:
          escapeHTML:  2329816.5 i/s
                gsub:   821889.7 i/s - 2.83x  slower

ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
                gsub    36.235k i/100ms
          escapeHTML   171.347k i/100ms
Calculating -------------------------------------
                gsub    359.528k (± 1.5%) i/s    (2.78 μs/i) -      1.812M in   5.040422s
          escapeHTML      1.812M (± 0.7%) i/s  (551.84 ns/i) -      9.081M in   5.011762s

Comparison:
          escapeHTML:  1812105.3 i/s
                gsub:   359527.5 i/s - 5.04x  slower
```

```ruby

require "benchmark/ips"
require "cgi"

ESCAPE = {
  ">" => '\u003e', "<" => '\u003c', "&" => '\u0026',
}

ESCAPE_B = {
  ">".b => '\u003e'.b,
  "<".b => '\u003c'.b,
  "&".b => '\u0026'.b,
}
ESCAPE_REGEX = Regexp.union(ESCAPE_B.keys)

str = ("a" * 1024).freeze
Benchmark.ips do |x|
  x.report("gsub") do
    b = str.b
    b.gsub!(ESCAPE_REGEX, ESCAPE_B)
    b.force_encoding(str.encoding)
  end

  x.report("escapeHTML") do
    CGI.escapeHTML(str, ESCAPE)
  end

  x.compare!
end

str = (("a" * 1023) + "<").freeze
Benchmark.ips do |x|
  x.report("gsub") do
    b = str.b
    b.gsub!(ESCAPE_REGEX, ESCAPE_B)
    b.force_encoding(str.encoding)
  end

  x.report("escapeHTML") do
    CGI.escapeHTML(str, ESCAPE)
  end

  x.compare!
end
```
@byroot byroot force-pushed the escape-html-provided-list branch from 57a61dc to 51256f6 Compare June 23, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants